\documentclass[12pt,titlepage]{article}
\usepackage[a4paper,top=30mm,bottom=30mm,left=20mm,right=20mm]{geometry}
\usepackage{url}
\usepackage{alltt}
\usepackage{xspace}
\usepackage{times}
\usepackage{listings}

\usepackage{verbatim}
%\usepackage[dvips]{hyperref}
\usepackage{optionman}

\newcommand{\GenomeTools}{\textit{GenomeTools}\xspace}
\newcommand{\Readjoiner}{\textit{Readjoiner}\xspace}
\newcommand{\Rdjprefilter}{\textit{Readjoiner Prefilter}\xspace}
\newcommand{\Rdjoverlap}{\textit{Readjoiner Overlap}\xspace}
\newcommand{\Rdjassembly}{\textit{Readjoiner Assembly}\xspace}
\newcommand{\Gtconvertseq}{\textit{Convertseq}\xspace}
\newcommand{\Gtreadreads}{\textit{Readreads}\xspace}
\newcommand{\Gtsuffixerator}{\textit{Suffixerator}\xspace}

\newcommand{\Gtcmd}{\texttt{gt}\xspace}
\newcommand{\Readjoinercmd}{\texttt{gt readjoiner}\xspace}
\newcommand{\Rdjprefiltercmd}{\texttt{gt readjoiner prefilter}\xspace}
\newcommand{\Rdjoverlapcmd}{\texttt{gt readjoiner overlap}\xspace}
\newcommand{\Rdjassemblycmd}{\texttt{gt readjoiner assembly}\xspace}
\newcommand{\Gtconvertseqcmd}{\texttt{gt convertseq}\xspace}
\newcommand{\Gtreadreadscmd}{\texttt{gt dev readreads}\xspace}
\newcommand{\Gtsuffixeratorcmd}{\texttt{gt suffixerator}\xspace}

\newcommand{\minlen}{\ell}

\title{\Huge{\Readjoiner 1.2:\\ User manual.}\\[3mm]
\Large{A fast and memory efficient \\string
 graph-based sequence assembler.}}
\author{\begin{tabular}{c}
         \textit{Giorgio Gonnella}\\
         \textit{Stefan Kurtz}\\[2cm]
         Research Group for Genome Informatics\\
         Center for Bioinformatics\\
         University of Hamburg\\
         Bundesstrasse 43\\
         20146 Hamburg (Germany)\\[1cm]
         \url{gonnella@zbh.uni-hamburg.de}\\
         \url{kurtz@zbh.uni-hamburg.de}\\[1cm]
        \end{tabular}}

\date{18/02/2014}
\begin{document}
\maketitle

\section{Introduction} \label{Introduction}

\Readjoiner is a software pipeline for the \textit{de novo} assembly of
 sequencing readsets, based on the assembly string graph framework
  \cite{MYE:2005}.

\Readjoiner is written in \texttt{C} and it is based on the
\GenomeTools library \cite{genometools}. It has no external library
dependencies and may be compiled on any POSIX-compliant operative system.
 \Readjoiner is implemented as a collection of tools, all compiled in the single
\GenomeTools binary named \Gtcmd.

The \Readjoiner assembly pipeline consists in the following phases:\\[3mm]
\begin{tabular}{lll}
Phase & Tool & Description \\
\textit{Prefiltering} & \Rdjprefiltercmd &
  encode reads, remove contained and ambiguous\\
\textit{Overlap}   & \Rdjoverlapcmd &
  determine all pairs suffix-prefix matches (SPMs) \\
\textit{Assembly}  & \Rdjassemblycmd &
 build the string graph, output the contigs \\
\end{tabular}

\Readjoiner can be installed either from a binary distribution or from the
source code. In both cases, you will have the binary file \texttt{gt}.
The rest of the manual assumes this binary is in the command line path.

The \Readjoiner tools are run using the following syntax:
\texttt{gt readjoiner <tool> <arguments>}, where \texttt{<tool>}
is either prefilter, overlap or assembly.

The arguments are different,
depending on the tool, however some options are common:
\begin{itemize}
\item \Showoption{help} will show an useful help message on the command
line, with a summary of the most important options.
\item \Showoption{help+} will show an expanded help message, in which
further less common options are also listed.
\item \Showoption{v} increases the verbosity.
\item \Showoption{q} suppresses all output messages.
\end{itemize}

\subsection{Installation from the source code}

For installation from the source code, please read the instruction in the
\texttt{INSTALL}
file (in the main directory of the sources). In most cases a 64-bit version
is necessary, so make sure you use the \texttt{64bit=yes} make flag.
External dependencies such as cairo are not necessary for
\Readjoiner, thus if you plan to use only \Readjoiner you may safely choose
to use the \texttt{cairo=no} make flag.

If you use the \texttt{amalgamation=yes} make flag, you will probably notice
a significative speed-up, thus this option is reccomended. Further improvements
in the runtime can be achieved by using \texttt{assert=no}: this skips all
runtime assertions; if you encounter any problem by running \Readjoiner
then you should (re-)compile using assertions before reporting a bug,
as assertions are an useful source of information for the developer.

Using the option \texttt{threads=yes} you can enable multithreading support.
Currently the overlap phase can be run in parallel mode, thus this option is
reccomended if \Readjoiner is run on a multi-core or multi-processor
machine.

\section{Error correction}

Most sequencing technologies generate reads which contain technology-specific
errors. For example, 454 and IonTorrent reads will contain mainly indels in
homopolymeric regions, while Illumina reads will contain mainly substitution
errors. An error correction prior to genome assembly is an essential step,
and it is very important for Readjoiner, that the input reads have been already
processed using an error correction tool.

A number of stand-alone error correction tools exists: an example of a tool
with a good performance on subsitution errors in Quake (Kelley et al., 2010).
For homopolymer errors if a reference sequence is available, have
a look to our error corrector HOP, also a part of GenomeTools.

For susbtitution errors there is also an experimental support in Readjoiner
of k-mer based error correction. If you want to try it,
run \texttt{gt dev seqcorrect -help} for more information.

%\newpage

\section{Prefiltering}

Readjoiner prefiltering step has the following functions:
\begin{itemize}
\item Remove ambiguity-containing reads
\item Remove redundant information (duplicate and suffix/prefix reads)
\item Encode the readset in GtEncseq format
\end{itemize}

\subsection{Describing the read set}

The input to \Rdjprefilter consists of sequencing reads, in
FastQ or Fasta format.
The input is specified using the \Showoption{db} option, which
takes multiple arguments, each argument describes a sequencing
library.

For single-end libraries, the
corresponding argument of \Showoption{db} is simply the filename.
Paired-end libraries are accepted in two variants: \textit{two-files} libraries,
where a file contains the forward reads and a second file the reverse reads
in the same order, and \textit{interleaved} libraries, consisting of a single
file, where the first, third, fifth, etc sequences are the forward reads, the
second, fourth, sixth, etc sequences are the reverse reads.
For two-files libraries, the argument of \Showoption{db} is in the format
\verb|<file1>:<file2>:<mean>[,<std>]| where \verb/<file1|2>/ are
 filenames, \verb|<mean>| is the average
insert size and \verb|<std>| is the standard deviation of insert
sizes (which can be omitted if unknown). For interleaved libraries
the library is specified as \verb|<file>:<mean>[,<std>]|.

Some examples:\\[3mm]
\begin{tabular}{p{0.5\textwidth}p{0.45\textwidth}}
Argument of \Showoption{db} & Meaning:\\
\hline
\Showoptionarg{se.fas}
& single-end library (\texttt{se.fas})\\
\hline
\Showoptionarg{pe.fastq:200}
& interleaved paired-end library with average insert size of 200,
standard deviation unknown\\
\hline
\Showoptionarg{pe\_f.fastq:pe\_r.fastq:10000}
& a paired-end library in two files,
with average insert size of about 10kb,
standard deviation unknown\\
\hline
\Showoptionarg{se.fastq pe.fastq:180,20}
& a single-end library plus
an interleaved paired-end library with average insert size of 180,
standard deviation 20\\
\hline
\Showoptionarg{short.fastq:100 long\_f.fastq:long\_r.fastq:10000,1200 se.fastq}
& an interleaved paired-end library with insert size 100 plus
a two-files paired-end library with insert size 10000, standard
deviation 1200, and a single-end library\\
\hline
\end{tabular}

%\newpage

\subsection{Setting the read set name}

The argument of the option \Showoption{readset}
is the read set name. This will be used as basename
for the output and intermediate files and must
specified in subsequent calls to the \texttt{overlap}
and \texttt{assembly} tools.

\subsection{Contained reads}

Duplicate reads and suffix/prefix reads (i.e. reads which are suffixes or
prefixes of other reads) are contained reads, thus
redundant sequence information, which is not represented in the string graph
\cite{MYE:2005}. Furthermore, our overlap algorithm assumes that the read set is
suffix-prefix-free, thus these reads are recognized and eliminated
in the prefiltering phase.

Reads are hereby considered from both strands,
thus for example a reads which is equal to the reverse complement of another
read is also removed.
Internally contained reads (reads equal to an internal substring of
other reads / reverse complement) are not identified here, but during
the \Rdjoverlap step.

In a paired-end library, if a read is contained, its mate pair is
discarded too, to keep the library balanced.

\subsection{Quality scores}

Readjoiner supports the input formats FastQ and Fasta.
The FastQ format is assumed to use the standard \textit{phred33} scores.
As an alternative, \textit{phred64} scores are also supported; in this case
the option \Showoption{phred64} must be specified. The only limitation is that
it is not possible to mix libraries with \textit{phred33} scores
and libraries with \textit{phred64} scores.

The quality scores can be used for quality-based trimming. In order to activate
this feature, the \Showoption{maxlow} option must be specified, with an
argument $n$, which is the maximal number of low-quality bases acceptable in a
read. If a read has more than $n$ low-quality bases, it is discarded. The
definition of low-quality bases is any base whose associated quality score
is at most $l$, where $l$ is an user-defined parameter, by default 3, which
can be set using the \Showoption{lowqual} option.

\subsection{Ambiguous base calls}

Any read containing ambiguities is discarded as low-quality.
If you want to keep reads containing ambuigities, you will have to split
them at the ambiguity positions or permutate the ambiguities to a
random non-ambiguous code using an external program or script.
In a paired-end library, if a read is discarded, its mate pair is
discarded too, so that the library is still balanced.

%\newpage

\section{The overlap phase}

To construct the string graph, the suffix-prefix matches (SPM) among all pairs
of reads must be calculated. Due to the small size of the DNA alphabet (4),
small random matches from reads originating from different regions of the
original DNA molecule are common. Thus, to avoid spurious matches a minimal
match length pameter $\minlen$ is used.

Transitive edges are not present in the final string graph \cite{MYE:2005}.
\Rdjoverlap matching algorithm allows one to recognize which SPM would
correspond to transitive edges in the graph. Thus the graph can be constructed
only including the irreducible edges, requiring less memory.

The \Rdjoverlap tool computes the list of suffix-prefix matches among
all pairs of reads using a suffix sorting and scanning approach.
Non-relevant suffixes are excluded, such as those shorter
than $\minlen$ or without an initial $k$-mer matching a read or reverse
complement of read.

An index of the SPM-relevant suffixes is constructed during this phase.
The index construction is partitioned in order to limit the
space peak.
\Readjoiner uses heuristics
to automatically calculates a value for this parameter, which is usually
a good choice..
However, a memory limit or number of parts can be given as a parameter
to control this time vs.\ space tradeoff.

If the binary has been compiled with
multithreading support, you can specify the number of threads as
follows: \texttt{gt -j <N> readjoiner overlap <arguments>}.

\subsection{Running the \Rdjoverlap tool}

The input to the \Readjoiner overlap tool is a prefix- and suffix-free readset
in \GenomeTools encoded sequence format, usually the output of \Rdjprefilter.

The following options shall be used:

\begin{Justshowoptions}
\Option{readset}{\Showoptionarg{readset}}{
Specify the readset name. The argument should be the same used for the
prefilter tool.}
\Option{l}{\Showoptionarg{minlen}}{
Specify the minimal SPM length parameter.}
\end{Justshowoptions}

The minimal SPM length parameter is an essential parameter of the procedure.
A too large value will make the pipeline faster, but potentially miss some
important matches. A too small value will make the pipeline too slow, and add
too many spurious overlaps.

We found that the ideal value is in the range
$\approx 55-65$ for 100 bp reads. The assembly phase can be repeated using
different values of the minimal SPM length, without repeating the overlap phase,
as long as these values are larger than the value used in the overlap phase.
Therefore, one can choose a little smaller value, e.g. $\approx 50$ during
the overlap phase and then test the assembly phase using several values
equal or larger than that.

The output of the overlap phase consists in one or more
\texttt{.spm} files, containing a binary encoded
list of the non-redundant irreducible suffix-prefix matches in the readset.
 It is possible to manually examine the results, by converting
this list a text format, using the following command:
\texttt{gt readjoiner spmtest -readset <readsetname> -test showlist}.

\section{The assembly phase}

The \Rdjassembly tool constructs a string graph. Each non-contained read
in the sequencing readset is represented by a pair of vertices modeling
its two extremities (Begin and End vertices). Irreducible SPMs are represented
by directed edges.

Sequencing errors present in the reads lead to characteristic paths
in the graph. The error-correction algorithms described in \cite{Edena}
are implemented and can be applied to recognize dead ends
and p-bubbles and remove them (option \Showoption{errors}).
Finally the graph is traversed and sequences
corresponding to unbranched paths are output.

\subsection{Running the the \Rdjassembly tool}

The input consists in SPM lists in \Readjoiner format
which are the output of \Rdjoverlap.

The following options shall be used:
\begin{Justshowoptions}
\Option{readset}{\Showoptionarg{readset}}{
Specify the readset name, This must be the same used in the overlap phase.}
\Option{l}{\Showoptionarg{minlen}}{
(optional) Specify the minimal SPM length parameter.
The value should be at least as high as the one used for \Rdjoverlap.
If an higher value is used, shorter overlaps are not loaded in the
string graph.}
\Option{errors}{}{
Clean short dead ends and remove p-bubbles. This is reccomended}
\Option{spmfiles}{\Showoptionarg{n}}{
If the overlap phase was run using muliple threads, this option is
mandatory and should be set to the same number of threads used
by the \Option{j} in the overlap phase.
}
\end{Justshowoptions}

\section{Examples}

Consider a dataset containing paired-end reads with an insert size
of 1000bp (st.dev. 120bp)
in the files \texttt{p\_f.fastq} and \texttt{p\_r.fastq}
and single-end reads
in the file \texttt{se.fastq}.

This example assumes the reads have been previously error-corrected,
otherwise the results will be probably very poor.

The first step will be run by \Rdjprefilter.

\begin{footnotesize}
\begin{verbatim}
> gt readjoiner prefilter -readset myreadset \
  -db p_f.fastq:p_r.fastq:1000,120 se.fastq
\end{verbatim}
\end{footnotesize}

The result is output in
\GenomeTools encoded sequence format and consist in a \texttt{readset.esq} file.
The encoded sequence can now be used as input for \Rdjoverlap. We will use
a minimal SPM lenght of 50 for this example and suppose \Readjoiner has been
compiled with multi-threading support and 8 cores are available.

\begin{footnotesize}
\begin{verbatim}
> gt -j 8 readjoiner overlap -readset readset -l 50
\end{verbatim}
\end{footnotesize}

The output will be a set of files \texttt{readset.<N>.spm}, with N in the
range 0 to 7, containing
lists of irreducible non-redundant SPMs in \Readjoiner binary format.
Note that the \Showoption{j} option is specified in a different position
(between \texttt{gt} and \texttt{readjoiner}) than the other options.

The next step consists in running the assembly phase using the
\Rdjassembly tool.

\begin{footnotesize}
\begin{verbatim}
> gt readjoiner assembly -spmfiles 8 -readset readset -l 60
\end{verbatim}
\end{footnotesize}

The \Showoption{spmfiles} option is only necessary if the overlap phase
was run using multiple threads: we used the same value for \Showoption{spmfiles}
which was used for the \Showoption{j} option in the overlap phase.

We used an higher mimimal length value that the one used in the overlap phase.
If the results are not satisfying, you can repeat the assembly phase
using a smaller value, such as \Showoption{l} \Showoptionarg{50} or
 \Showoption{l} \Showoptionarg{55}
(the minimum that can be used is the value used in the overlap phase)
and a larger value, such as \Showoption{l} \Showoptionarg{65} or
 \Showoption{l} \Showoptionarg{70}.

The results are contigs, which are saved in FASTA format
in \texttt{readset.contigs.fas}.
Basic statistics of the assembly results, such as the N50 value
and the total length, are displayed in the standard output.

\begin{thebibliography}{1}

\bibitem{MYE:2005}
Myers, EW. (2005).
\newblock {{T}he fragment assembly string graph}.
\newblock {\em Bioinformatics\/}, {21 Suppl 2}, 79--85.

\bibitem{genometools}
Gremme, G. (2011).
\newblock The \textsc{GenomeTools} genome analysis system.
  \url{http://genometools.org}.

\bibitem{AKO04}
Abouelhoda MI, Kurtz S, and Ohlebusch E.
\newblock Replacing suffix trees with enhanced suffix arrays.
\newblock {\em Journal of Discrete Algorithms}, 2:53--86, 2004.

\bibitem{Edena}
Hernandez D, François P, Farinelli L, Osterås M, and Schrenzel J.
  (2008).
\newblock De novo bacterial genome sequencing: millions of very short reads
  assembled on a desktop computer.
\newblock {\em Genome Res\/}, {18}(5), 802--809.

\bibitem{Readjoiner}
Gonnella G and Kurtz S (2012).
\newblock {Readjoiner. A fast and memory efficient string
 graph-based sequence assembler.}
\newblock {\em BMC Bioinformatics}, {13}:82.

\end{thebibliography}

\end{document}
